Mann–Whitney U Test (Wilcoxon Rank-Sum) — From Scratch with NumPy#

The Mann–Whitney U test is a nonparametric test for comparing two independent samples. Instead of comparing means (like a two-sample t-test), it compares ranks — which makes it useful when data is skewed, heavy-tailed, or ordinal.

What you’ll learn#

  • when to use Mann–Whitney U (and when not to)

  • the hypotheses it tests (what “significant” actually means)

  • how \(U\) is computed (wins + ranks)

  • a low-level implementation in NumPy (including ties)

  • how to visualize the null distribution via permutations

  • how to interpret results + effect sizes

import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio

from math import erf, sqrt

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

1) When to use it#

Use Mann–Whitney U when you have:

  • two independent groups (different people, different devices, etc.)

  • a numeric or at least ordinal outcome

  • a reason to avoid a normality assumption (skew, heavy tails, small samples)

Common situations:

  • A/B testing for skewed metrics (latency, time-on-task)

  • small samples where “normal enough” is questionable

  • ordinal ratings (Likert) treated as ordered

Not appropriate when:

  • observations are paired / repeated measures (use the Wilcoxon signed-rank test instead)

  • you have more than two groups (use Kruskal–Wallis, then post-hoc tests)

  • data are not independent (clustered/longitudinal without modeling that structure)

2) What hypothesis does it test?#

Let \(X\) be a random draw from group A and \(Y\) from group B.

A convenient way to express the null hypothesis is:

\[ H_0: \; P(X > Y) + \tfrac12 P(X = Y) = 0.5 \]

So under \(H_0\), a value from A is equally likely to be larger than a value from B.

Alternatives:

  • two-sided: the distributions differ in either direction

  • greater: A tends to have larger values than B

  • less: A tends to have smaller values than B

Important nuance:

  • Mann–Whitney detects differences in distributions.

  • If the two groups have similar shapes/spreads and mostly differ by a shift, a significant result is often interpreted as a median shift.

3) Example data (skewed metric)#

We’ll simulate an A/B test on a skewed metric (think “response time”). Group B will be slightly smaller (faster) while keeping a similar shape.

rng = np.random.default_rng(7)

n_a = 25
n_b = 28

group_a = rng.lognormal(mean=0.10, sigma=0.55, size=n_a)
group_b = rng.lognormal(mean=-0.30, sigma=0.55, size=n_b)

{
    "n": (n_a, n_b),
    "mean": (group_a.mean(), group_b.mean()),
    "median": (np.median(group_a), np.median(group_b)),
}
{'n': (25, 28),
 'mean': (0.9872234195640509, 0.7597615779329113),
 'median': (0.9504917274585137, 0.7089964448880488)}
fig = go.Figure()
fig.add_trace(
    go.Violin(
        y=group_a,
        name="A (control)",
        box_visible=True,
        meanline_visible=True,
        points="all",
        jitter=0.25,
        scalemode="width",
        marker=dict(size=6, opacity=0.7, color="#636EFA"),
        line=dict(color="#636EFA"),
    )
)
fig.add_trace(
    go.Violin(
        y=group_b,
        name="B (treatment)",
        box_visible=True,
        meanline_visible=True,
        points="all",
        jitter=0.25,
        scalemode="width",
        marker=dict(size=6, opacity=0.7, color="#EF553B"),
        line=dict(color="#EF553B"),
    )
)
fig.update_layout(
    title="Skewed metric in two independent groups",
    yaxis_title="Response time (arbitrary units)",
    width=760,
    height=450,
)
fig.show()

4) Core idea: replace values by ranks#

The test works on ranks:

  1. Pool the two samples.

  2. Sort them.

  3. Assign ranks \(1,2,\ldots,n\).

  4. Add up the ranks inside each group.

If there are ties, we assign the tied values their average rank.

def rankdata_average(values):
    values = np.asarray(values)
    order = np.argsort(values, kind="mergesort")
    sorted_vals = values[order]

    ranks_sorted = np.empty_like(sorted_vals, dtype=float)
    _, first, counts = np.unique(sorted_vals, return_index=True, return_counts=True)
    for start, count in zip(first, counts):
        end = start + count
        avg_rank = 0.5 * ((start + 1) + end)
        ranks_sorted[start:end] = avg_rank

    ranks = np.empty_like(ranks_sorted)
    ranks[order] = ranks_sorted
    return ranks


demo = np.array([10, 10, 20, 15])
rankdata_average(demo)
array([1.5, 1.5, 4. , 3. ])

5) From rank sums to \(U\) (the “wins” statistic)#

Let:

  • \(n_1\) = size of group A

  • \(n_2\) = size of group B

  • \(R_1\) = sum of ranks for group A (after pooling + ranking)

Then:

\[ U_1 = R_1 - \frac{n_1(n_1+1)}{2} \]

\(U_1\) can be interpreted as:

  • the number of (A,B) pairs where \(A > B\)

  • plus half the number of ties

And:

\[ U_2 = n_1 n_2 - U_1 \]

For a two-sided test, many references report \(U = \min(U_1, U_2)\).

def u_from_pairs(x, y):
    x = np.asarray(x)
    y = np.asarray(y)
    gt = (x[:, None] > y[None, :]).sum()
    eq = (x[:, None] == y[None, :]).sum()
    return gt + 0.5 * eq


def mann_whitney_u_from_ranks(x, y):
    x = np.asarray(x)
    y = np.asarray(y)
    if x.size == 0 or y.size == 0:
        raise ValueError("Both samples must be non-empty")

    n1 = x.size
    n2 = y.size

    pooled = np.concatenate([x, y])
    ranks = rankdata_average(pooled)

    R1 = ranks[:n1].sum()
    U1 = R1 - n1 * (n1 + 1) / 2
    U2 = n1 * n2 - U1
    return U1, U2, ranks


x_small = np.array([3, 1, 5])
y_small = np.array([2, 4, 6])

U1_pair = u_from_pairs(x_small, y_small)
U1_rank, U2_rank, _ = mann_whitney_u_from_ranks(x_small, y_small)

(U1_pair, U1_rank, U2_rank)
(3.0, 3.0, 6.0)
U1, U2, ranks = mann_whitney_u_from_ranks(group_a, group_b)
U = min(U1, U2)
(U1, U2, U)
(476.0, 224.0, 224.0)
pooled = np.concatenate([group_a, group_b])
labels = np.array(["A"] * len(group_a) + ["B"] * len(group_b))
ranks = rankdata_average(pooled)

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=ranks[labels == "A"],
        y=pooled[labels == "A"],
        mode="markers",
        name="A",
        marker=dict(size=8, opacity=0.85, color="#636EFA"),
    )
)
fig.add_trace(
    go.Scatter(
        x=ranks[labels == "B"],
        y=pooled[labels == "B"],
        mode="markers",
        name="B",
        marker=dict(size=8, opacity=0.85, color="#EF553B"),
    )
)
fig.update_layout(
    title="Pooled ranks: lower values get lower ranks",
    xaxis_title="Rank (1 = smallest)",
    yaxis_title="Value",
    width=760,
    height=450,
)
fig.show()

6) The p-value: a null distribution for \(U\)#

Under \(H_0\), the two groups are exchangeable: the labels “A” and “B” shouldn’t matter.

So we can build a null distribution by:

  1. pooling the observed values

  2. randomly re-assigning \(n_1\) of them to “A” (the rest to “B”)

  3. computing \(U_1\) each time

The p-value is the proportion of permutations with a statistic at least as extreme as what we observed.

def mw_u_permutation_test(x, y, alternative="two-sided", n_permutations=20_000, rng=None):
    x = np.asarray(x)
    y = np.asarray(y)
    if x.size == 0 or y.size == 0:
        raise ValueError("Both samples must be non-empty")

    if rng is None:
        rng = np.random.default_rng()

    n1 = x.size
    n2 = y.size
    n = n1 + n2

    pooled = np.concatenate([x, y])
    ranks = rankdata_average(pooled)

    obs_R1 = ranks[:n1].sum()
    obs_U1 = obs_R1 - n1 * (n1 + 1) / 2

    keys = rng.random((n_permutations, n))
    idx = np.argpartition(keys, kth=n1 - 1, axis=1)[:, :n1]
    perm_R1 = ranks[idx].sum(axis=1)
    perm_U1 = perm_R1 - n1 * (n1 + 1) / 2

    if alternative == "greater":
        p = (np.sum(perm_U1 >= obs_U1) + 1) / (n_permutations + 1)
    elif alternative == "less":
        p = (np.sum(perm_U1 <= obs_U1) + 1) / (n_permutations + 1)
    elif alternative == "two-sided":
        lo = (np.sum(perm_U1 <= obs_U1) + 1) / (n_permutations + 1)
        hi = (np.sum(perm_U1 >= obs_U1) + 1) / (n_permutations + 1)
        p = min(1.0, 2 * min(lo, hi))
    else:
        raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")

    return obs_U1, perm_U1, p
rng_perm = np.random.default_rng(7)

obs_U1, perm_U1, p_perm = mw_u_permutation_test(
    group_a,
    group_b,
    alternative="two-sided",
    n_permutations=20_000,
    rng=rng_perm,
)

{
    "U1": obs_U1,
    "U2": len(group_a) * len(group_b) - obs_U1,
    "p_value (permutation)": p_perm,
}
{'U1': 476.0, 'U2': 224.0, 'p_value (permutation)': 0.024098795060246987}
mu_null = len(group_a) * len(group_b) / 2
d = abs(obs_U1 - mu_null)

fig = go.Figure()
fig.add_trace(go.Histogram(x=perm_U1, nbinsx=40, name="permuted U1", marker=dict(color="rgba(99,110,250,0.65)")))
fig.add_vline(
    x=obs_U1,
    line_width=3,
    line_color="black",
    annotation_text=f"observed U1 = {obs_U1:.1f}",
    annotation_position="top",
)
fig.add_vline(
    x=mu_null,
    line_width=2,
    line_dash="dash",
    line_color="gray",
    annotation_text="null mean",
    annotation_position="top right",
)
fig.add_vrect(
    x0=min(perm_U1),
    x1=mu_null - d,
    fillcolor="rgba(239,85,59,0.18)",
    line_width=0,
)
fig.add_vrect(
    x0=mu_null + d,
    x1=max(perm_U1),
    fillcolor="rgba(239,85,59,0.18)",
    line_width=0,
)
fig.update_layout(
    title=f"Null distribution of U1 under H0 (permutations), p ≈ {p_perm:.4f}",
    xaxis_title="U1",
    yaxis_title="Count",
    width=760,
    height=450,
)
fig.show()

7) Normal approximation (large samples)#

For larger samples, \(U_1\) is well-approximated by a normal distribution.

Mean under \(H_0\):

\[ \mu_U = \frac{n_1 n_2}{2} \]

Variance under \(H_0\) (no ties):

\[ \sigma_U^2 = \frac{n_1 n_2 (n_1 + n_2 + 1)}{12} \]

With ties, we apply a standard tie correction to the variance.

Then we compute a z-score (often with a continuity correction) and get a p-value from the normal CDF.

def normal_cdf(z):
    return 0.5 * (1 + erf(z / sqrt(2)))


def mw_u_normal_approx(x, y, alternative="two-sided", continuity=True):
    x = np.asarray(x)
    y = np.asarray(y)
    if x.size == 0 or y.size == 0:
        raise ValueError("Both samples must be non-empty")

    n1 = x.size
    n2 = y.size
    n = n1 + n2

    pooled = np.concatenate([x, y])
    ranks = rankdata_average(pooled)
    R1 = ranks[:n1].sum()
    U1 = R1 - n1 * (n1 + 1) / 2

    mu = n1 * n2 / 2
    _, counts = np.unique(pooled, return_counts=True)
    tie_term = np.sum(counts**3 - counts)
    var = n1 * n2 / 12 * ((n + 1) - tie_term / (n * (n - 1)))
    sigma = sqrt(var)

    if sigma == 0:
        return U1, 0.0, 1.0

    cc = 0.0
    if continuity:
        if alternative == "greater":
            cc = -0.5
        elif alternative == "less":
            cc = 0.5
        elif alternative == "two-sided":
            cc = -0.5 if U1 > mu else 0.5 if U1 < mu else 0.0
        else:
            raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")

    z = (U1 - mu + cc) / sigma
    cdf = normal_cdf(z)

    if alternative == "greater":
        p = 1 - cdf
    elif alternative == "less":
        p = cdf
    else:
        p = 2 * min(cdf, 1 - cdf)

    return U1, z, p
U1_norm, z, p_norm = mw_u_normal_approx(group_a, group_b, alternative="two-sided", continuity=True)

{
    "U1": U1_norm,
    "z": z,
    "p_value (normal approx)": p_norm,
}
{'U1': 476.0,
 'z': 2.2360857240006173,
 'p_value (normal approx)': 0.025346156404938203}

8) Interpreting the result#

What a small p-value means:

  • If \(H_0\) were true (no tendency for A to be larger than B), then observing a \(U\) at least this extreme would be rare.

  • So a small p-value is evidence against \(H_0\).

What it does not mean:

  • It does not tell you the probability that \(H_0\) is true.

  • It does not tell you how large the difference is.

A simple, interpretable effect size is the probability of superiority:

\[ A = P(X > Y) + \tfrac12 P(X = Y) = \frac{U_1}{n_1 n_2} \]
  • \(A = 0.5\) suggests no tendency for A to be larger than B.

  • \(A = 0.65\) means a random A observation exceeds a random B observation about 65% of the time (counting ties as half).

n1 = len(group_a)
n2 = len(group_b)

A = obs_U1 / (n1 * n2)
cliffs_delta = 2 * A - 1
r_effect = abs(z) / sqrt(n1 + n2)

{
    "probability_of_superiority (A)": A,
    "Cliff's delta (2A-1)": cliffs_delta,
    "r (|z|/sqrt(n))": r_effect,
}
{'probability_of_superiority (A)': 0.68,
 "Cliff's delta (2A-1)": 0.3600000000000001,
 'r (|z|/sqrt(n))': 0.3071499960863374}
fig = go.Figure()
fig.add_trace(go.Bar(x=[A], y=["A"], orientation="h", marker=dict(color="#00CC96")))
fig.add_vline(x=0.5, line_dash="dash", line_color="gray")
fig.add_annotation(x=A, y=0, text=f"{A:.2f}", showarrow=False, yshift=14)
fig.update_layout(
    title="Common-language effect size: probability of superiority",
    xaxis=dict(range=[0, 1], title="P(A > B) + 0.5·P(tie)"),
    yaxis=dict(showticklabels=False),
    width=760,
    height=260,
    showlegend=False,
)
fig.show()

9) A cautionary example: “same median” does not imply “no difference”#

Mann–Whitney U is often described as a test for a median shift. That interpretation is most defensible when the two distributions have similar shapes.

Here we build two samples with exactly the same sample median, but different shapes. It’s still possible for Mann–Whitney U to be significant, because the test is about relative ordering, not “median equality”.

rng_shape = np.random.default_rng(7)

n = 200
x = rng_shape.normal(0, 1, size=n)
y = rng_shape.exponential(scale=1.0, size=n) - np.log(2)
y = y - np.median(y) + np.median(x)

U1_xy, z_xy, p_xy = mw_u_normal_approx(x, y, alternative="two-sided", continuity=True)

{
    "sample_median_x": float(np.median(x)),
    "sample_median_y": float(np.median(y)),
    "p_value (normal approx)": p_xy,
}
{'sample_median_x': -0.1307394221048977,
 'sample_median_y': -0.1307394221048977,
 'p_value (normal approx)': 0.010240851796336825}
fig = go.Figure()
fig.add_trace(
    go.Violin(
        y=x,
        name="x (Normal)",
        box_visible=True,
        meanline_visible=True,
        points=False,
        line=dict(color="#636EFA"),
        fillcolor="rgba(99,110,250,0.35)",
    )
)
fig.add_trace(
    go.Violin(
        y=y,
        name="y (Shifted Exponential)",
        box_visible=True,
        meanline_visible=True,
        points=False,
        line=dict(color="#EF553B"),
        fillcolor="rgba(239,85,59,0.35)",
    )
)
fig.update_layout(
    title=f"Different shapes with the same sample median (p ≈ {p_xy:.4f})",
    yaxis_title="Value",
    width=760,
    height=450,
)
fig.show()

10) Pitfalls + diagnostics#

  • Independence is crucial: if measurements are paired/repeated, switch tests.

  • Ties: common in discrete/rounded data. Average ranks + tie-corrected variance help, but consider a permutation test.

  • Report an effect size: a significant p-value can still correspond to a small \(A\) (small practical difference).

  • Distribution shape matters: if spreads/skews differ a lot, interpret the result as “distribution difference”, not “median difference”.

  • Power: like any test, Mann–Whitney can miss small effects with small samples.

11) Practical usage (SciPy)#

In real projects, you’ll usually call a library implementation. SciPy’s mannwhitneyu returns the U statistic (for the first sample you pass) and a p-value.

Below is an optional sanity check against SciPy’s asymptotic (normal-approx) result.

try:
    from scipy.stats import mannwhitneyu

    res = mannwhitneyu(group_a, group_b, alternative="two-sided", method="asymptotic")
    {
        "ours_U1": float(U1_norm),
        "ours_p_value": float(p_norm),
        "scipy_U1": float(res.statistic),
        "scipy_p_value": float(res.pvalue),
    }
except Exception as e:
    f"SciPy check skipped: {e}"

Exercises#

  1. Implement \(U_1\) using pairwise comparisons and show it matches the rank-based formula (including ties).

  2. Use the permutation approach to approximate a p-value for different sample sizes and see how it stabilizes as n_permutations increases.

  3. Simulate power: pick a small shift in location and estimate how often Mann–Whitney rejects at \(\alpha=0.05\).

References#

  • SciPy documentation: scipy.stats.mannwhitneyu

  • Conover, Practical Nonparametric Statistics

  • Lehmann, Nonparametrics: Statistical Methods Based on Ranks